Author: Jessica Marx
Date: 19 November 2018
Dataset:
November 8-16, 2018; tweets mentioning Nordstrom (and variations).
Jira Story: NORDACE-8398
Code: R
Purpose:
The DA team recently attended an R conference, which featured several talks on Text Mining and Sentiment analysis, including one from the author of the tidytext package. We wanted to take what we learned and apply it with regard to Nordstrom and how the company is discussed in the Twitter-verse.
Methodology:
Using the rtweet package, we pulled the maximum number of recent records mentioning Nordstrom (and all relevant variations – #nordstrom, @nordstrom, etc.), including re-tweets.
Results:
The following is meant to be a demonstration of the various types of text analyses that can be done with text mining and R.
First off, we used the tidytext package to separate each tweet by word, eliminate non-meaningful words (AKA “stop words”), and join words to the “Bing” sentiment lexicon. Note that “Bing” is binary – words are classified as either “Positive” or “Negative.”
This seems clear enough. But what words are contributing to each sentiment?
Most of these words seem correctly applied to their sentiment, but a few stand out as ambiguous or misclassified:- free
- trump
- fall
We can remove them as neutral words (for now) and return to our charts showing sentiments by day.
We can also use the “NRC” lexicon to track additional sentiments:
Everyone loves a word cloud!
Term Frequency
We can look at term frequency by day. This just shows us that every day there are a few words that are used extremely frequently and many that are used infrequently, which intuitively makes sense.
##Zipf’s Law Illustrating the relationship between the frequency that a word is used and its end rank with Zipf’s law (which states that the frequency that a word appears is inversely proportional to its rank).FYI: George Zipf was a 20th century American linguist.
The deviations at low rank mean that people who tweet about Nordstrom use a lower percentage of the most common words than what is expected.
Here is the same graph, but with an approximation of the slope:
TF IDF
The idea of tf-idf is to find the important words – words that stand out – by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection. Calculating tf-idf attempts to find the words that are important (i.e., common) in a text, but not too common.November 8 - 16, 2018:
Tokenizing with ngrams
When a word is preceded by a negating word, its meaning becomes its inverse. Let’s take a look using the “AFINN” lexicon, which scores each sentiment with different weights. Which words in the previous charts when seen in this context had an opposite meaning?
Not…
Note that we multiplied the sentiment score by -1 to reflect the inverse impact of the word “not.” Here is the same concept illustrated with a few more negating words:
Certain adjectives and adverbs are used for emphasis – very, really, extremely, only, actually, so… etc. We can double the score of the second word score to reflect the impact from the first word.
Network Analysis
Finally, we can visualize a network of bigrams (paired words):
Thanks for reading through; this project was fascinating to work on and we look forward to applying text mining to more areas within Nordstrom.